Clustering and classification

This week I learned about clustering and classification. How to cluster observations, how to study which factors affect or justify clustering, how many clusters is appropriate etc?

2.Data input and summary

This week’s data comes from an R package called MASS:

# access the MASS package
library(MASS)
## Warning: package 'MASS' was built under R version 3.4.4
# load the data
data("Boston")

# explore the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The dataset consist of 14 variables and 506 observations. All variables are numerical. One variable (‘chas’) is a 1/0, presence/absence dummy variable. The variables describe housing values in suburbs of Boston and factors measured at the suburbs which are thought to be related with housing values. Factors include measures of for example crime rate, access to Charles River, nitrogen oxides concentration, average number of rooms per dwelling, distances to five Boston employment centres, accessibility to radial highways, proportion of blacks by town and median value of owner-occupied homes. The full details can be found here.

3. Data exploration

Let’s explore graphically the distributions and relations of the data:

pairs(Boston)

This plot is difficult to read. I’ll figure out later how to improve the quality of the output. From the summary table I’m however able to explore the variation and distributions of the variables.

Here are two dotplots of variables ‘crim’ (per capita crime rate by town) and ‘zn’ (proportion of residential land zoned for lots over 25,000 sq.ft.) which show that these two variables are not very evenly distributed:

dotchart(Boston$crim)

dotchart(Boston$zn)

Some variables seem correlated. Here’s a correlation matrix of the variables:

library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.4
## corrplot 0.84 loaded
library(magrittr)

# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>%round(digits=2)

# print the correlation matrix
print(cor_matrix)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00

The above matrix is not very readable as it extends into two separate parts. Let’s present the correlations in a nicer way.

# visualize the correlation matrix
corrplot(cor_matrix, method="circle",type="upper",cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

This plot is easier to read. The bigger the circle the more correlated the variables are. Red indicates negative correlation and blue indicated positive correlation.

4. Standardize dataset

Some of the variables have very high values and wide distributions. We want to scale all variables because later on it may be difficult to sum or average variables that are on different scales. Scaling can be done to all variables in the dataset as they are all numerical.

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled<-as.data.frame(boston_scaled)

Now all the variables have their mean at zero and their distributions are more moderate.

4. Create a categorical variable

Next I create a categorical variable of the crime rate in the Boston dataset. I use quantiles as the break points. I drop the old crime rate variable from the dataset.

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE,label=c("low","med_low","med_high","high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

4. Divide the data into training and testing sets

For later model evaluation purposes I divide the dataset into training and testing datasets, so that 80% of the data belongs to the train set:

##dividing the data into training and testing sets

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

5. Fit a linear discriminant analysis on the train set.

Next I want to know which variables might explain the target variable crime rate. I do a linear discriminant analysis with the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables:

# linear discriminant analysis
lda.fit <- lda(crime~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2351485 0.2475248 0.2623762 
## 
## Group means:
##                  zn      indus        chas        nox         rm
## low       0.9689173 -0.8775882 -0.15765625 -0.8684159  0.3642782
## med_low  -0.1151118 -0.2820692  0.01777305 -0.5837490 -0.1105081
## med_high -0.3834775  0.1342090  0.20012296  0.3774234  0.1712535
## high     -0.4872402  1.0170298 -0.04947434  1.0384506 -0.4156729
##                 age        dis        rad        tax     ptratio
## low      -0.8529600  0.8634041 -0.6897369 -0.7265506 -0.46558251
## med_low  -0.3731723  0.3549344 -0.5478716 -0.4590917 -0.09177667
## med_high  0.4107839 -0.3771339 -0.3789260 -0.3075711 -0.28108495
## high      0.8272262 -0.8638605  1.6390172  1.5146914  0.78181164
##               black      lstat        medv
## low       0.3733517 -0.7272796  0.45109276
## med_low   0.3448706 -0.1576222  0.03569018
## med_high  0.0802057  0.0125523  0.20269355
## high     -0.8225573  0.9025471 -0.70077172
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.12599291  0.772806223 -1.01479849
## indus   -0.02212879 -0.198951275  0.33877198
## chas    -0.07221584 -0.078038290  0.13634154
## nox      0.37417400 -0.745174006 -1.36124805
## rm      -0.09899783 -0.155353828 -0.10341794
## age      0.34305175 -0.289932309 -0.26384134
## dis     -0.10993015 -0.290975763  0.19578829
## rad      3.03605989  0.894856162 -0.16228546
## tax     -0.03794671  0.008382812  0.73117889
## ptratio  0.10257675 -0.018590069 -0.31404163
## black   -0.15166725  0.017962832  0.14633542
## lstat    0.16675360 -0.259427485  0.46366379
## medv     0.13793487 -0.440892859 -0.08619663
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9461 0.0392 0.0147
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2,col=classes,pch=classes)
lda.arrows(lda.fit, myscale = 1)

6.Predict

Next I want to use the observations in the test set to predict crime classes. I do this because I want to estimate the “goodness” of my model by comparing predictions to observed “real” data.

For prediction I use the LDA model on the test data. For comparison I tabulate the results with the crime categories from the test set:

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       14       9        1    0
##   med_low    5      17        9    0
##   med_high   0       5       21    0
##   high       0       0        0   21

I did the random division of train and test data and predicted the above classes twice. First I got a fairly poor result with more than half of the med_high cases predicted incorrectly. On the second round the results look better (results shown here). Some classes are still incorrectly predicted but at least most of the predictions are correct.

7. K-means clustering

Next I study the boston data without any classifications and try to cluster the data into groups. Maybe the observations form clusters according to the suburbs. I run k-means algorithm on the dataset, investigate what is the optimal number of clusters and run the algorithm again.

First I reload the Boston dataset and standardize it. Then I calculate the Euklidean distances between the observations and present a summary of the distances:

#standardize the data set
boston_scaled2 <- scale(Boston)

# class of the boston_scaled object
class(boston_scaled2)
## [1] "matrix"
# change the object to data frame
boston_scaled2<-as.data.frame(boston_scaled2)

# euclidean distance matrix
dist_eu <- dist(boston_scaled2)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Next I run the k-means clustering with 3 centers.

# k-means clustering
km <-kmeans(boston_scaled2, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled2[9:14], col = km$cluster)

I zoomed in to various parts of the plot and found that when looking at the variable ‘tax’ it is divided into clusters so that at least the black observations belong clearly to their own group.

I also explored the clustering with 5 centers. The grouping seemed even more arbitrary.

Now, I’m not sure about the best number of clusters so I count the total of within cluster sum of squares (WCSS) and see how it behaves when the number of clusters change:

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
# set values
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

The total WCSS drops dramatically at around the value 2. That is the optimal number of clusters for this dataset.

I run the clustering again with 2 centers:

# k-means clustering
km <-kmeans(boston_scaled2, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled2[1:6], col = km$cluster)

Now the clustering seems better, at least for some variable pairs. But on my opinion, having only two groups doesn’t tell much. Maybe it suggests that the residents in Boston are divided into two groups, the wealthy and the poor?

Bonus:

Next I perform the LDA again to the boston dataset, this time with clusters (3) as the target variable. By visualizing the results with a biplot I can interpret which variables influence the clustering.

boston_scaled3<-boston_scaled2

# k-means clustering
km <-kmeans(boston_scaled3, centers = 3)

klusteri<-km$cluster
class(klusteri)
## [1] "integer"
boston_scaled3<-cbind(boston_scaled3,klusteri)
summary(boston_scaled3)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv            klusteri    
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063   Min.   :1.000  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989   1st Qu.:1.000  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449   Median :2.000  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   :1.972  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683   3rd Qu.:3.000  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865   Max.   :3.000
# linear discriminant analysis
lda.fit2 <- lda(klusteri~., data = boston_scaled3)

# print the lda.fit object
lda.fit2
## Call:
## lda(klusteri ~ ., data = boston_scaled3)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.3003953 0.4268775 0.2727273 
## 
## Group means:
##         crim         zn      indus        chas        nox         rm
## 1  0.8942488 -0.4872402  1.0913679 -0.01330932  1.1109351 -0.4609873
## 2 -0.3688324 -0.3935457 -0.1369208  0.07398993 -0.1662087 -0.1700456
## 3 -0.4076669  1.1526549 -0.9877755 -0.10115080 -0.9634859  0.7739125
##          age         dis        rad        tax     ptratio      black
## 1  0.7828949 -0.84882600  1.3656860  1.3895093  0.63256391 -0.7083974
## 2  0.1673019 -0.07766431 -0.5799077 -0.5409630 -0.04596655  0.2680397
## 3 -1.1241828  1.05650031 -0.5965522 -0.6837494 -0.62478941  0.3607235
##         lstat        medv
## 1  0.90799414 -0.69550394
## 2 -0.05818052 -0.04811607
## 3 -0.90904433  0.84137443
## 
## Coefficients of linear discriminants:
##                  LD1         LD2
## crim     0.043702606  0.16161136
## zn       0.049248495  0.76920932
## indus   -0.331498698  0.02870425
## chas    -0.012406954 -0.11314905
## nox     -0.721972554  0.40566595
## rm       0.174541989  0.41632858
## age      0.006221178 -0.88117192
## dis      0.043869924  0.36910493
## rad     -1.256861546  0.47665247
## tax     -0.992855786  0.46457291
## ptratio -0.092336951 -0.01003010
## black    0.073915653 -0.03513128
## lstat   -0.372145848  0.38403679
## medv    -0.058153798  0.49571753
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8785 0.1215
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(boston_scaled3$klusteri)

# plot the lda results
plot(lda.fit2, dimen = 2,col=classes,pch=classes)
lda.arrows(lda.fit2, myscale = 1)

From these results I would interpret that the variable ‘rad’ (index of accessibility to radial highways) is the strongest linear separator in this dataset. Although many other variables follow not far behind.

Super-Bonus:

Next I’ll draw some 3D plots of the training data:

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

library(plotly)
## Warning: package 'plotly' was built under R version 3.4.4
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
## Warning: package 'bindrcpp' was built under R version 3.4.4
#Set the color to be the crime classes of the train set. Draw another 3D plot where the #color is defined by the clusters of the k-means. How do the plots differ?

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers',color=train$crime)  

I stop the exercise here because I’m having trouble understanding the instructions. I’m able to draw one 3D plot and crime seems to be a strong separator in the dataset. The second plot should demonstrate the division by clusters however I’m not sure anymore should I do the k-means clustering again to the training data and then change the code or could I do it just by modifying the color argument. I leave it here.